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Abstract ' 

• * ** 

^his paper outlines a technique fo# differentially weighting options 
of*a multiple choice test in a fashion that maximizes the item predictive 
validity. The rule can be applied with^ different number of categories 
and the, "optimal" number 6f categories can be determined by significance 
tests and/or through the R criterion. Our theoretical analysis indicates 
that more complex scoring rules have: higher item validities^ higher ^ 
item variances, higher scqre variances, and are also likely to increase 
the interitem correlations and the test reliability. A plausible expla- 
nation for the apparent paradox of lack of" improvement in the test 
validity, based tfn the relation between interitem "correlations and item 
validities, is offered. V / 
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Differential ^Weighting of JJultflle-Choi^e Items 



Background , \ . 

• « 

The question of differential weighting of multiple-choice items has 

generated a large number # of studies in' the psychological and educational 

literature (see Stanley & Wang 1970 aad Wang & Stanley 1970 for revifews). 

The bulk of the literature suggests tha^ assigning different weights to 

the items doss not significantly affect the test^ characteristic's and 

performance, but the possibility^ of differentially weighting the options 

(distracters) of any, given item has some attractive aspects. As a 

rfesult, seve'ral studies comparing and evaluating a variety of procedures 

of Differential Options Weighting (DOW) haVe been conducted in recent 

years (e.g. Hendrickson 1971, Ramsay 1968, Reilly & Jackson 197^ 

Echternacht 1976, Be jar & Weiss 1977, Donlon & FitzpaAick 1978). These 

studies suggest that the use of scoring procedures more complex than the 

regular 0-1 rule, has a beneficial effect on some of the test character- 

istics. When freeze weights were appjlied' to real and artifical data, 

indices of reliability and internal consistency have been improved. 

With one exception^ however (Echternacht 1976), no significant improve 

» 

ment in the predictive validity of the tests has been reported. 

* This fact is surprising* One would expect that when the information 
conveyed by each item 'is more complete anf better measured, the predictive 
validity of both iteta and test will be increased, In this pager we 
offer a theoretical analysis of the effects of differential weighting on 
validity.- By validity we refer to the prediction of an external criterion 
independently measured. By taking this dpppc£clfwe eliminate the item- test 
regression (often labelled the" discriminating power of the item) which 

■- -7 • '• • ' . ■ 



is a special case of Validity. We will comment on this problem in "a 

* 

' separate section. We examine a procedure which has the property of 
maximizing the item-criterion correlation. Therefore any other nonoptimal 
DOW,« or regular scoring rule, can be evaluated by ' comparing its prescribed 

** item weights to the optimal weights. Such* a rule provides 3n indication 
of *how well & scoring, rule can be expected, to improve the prediction of 
the crtter^on and provides a tneaningful standard^ of comparison for any 

other alternative mm-optimal procedure. • It should be emphasized that 

i 

optimality here relets to item validity only, and that the rule may have 
damaging effects (at least theoretically) on other aspects of the items 
and the test.. We .will also examine some of the side effects of this 
♦technique which will enable us to better assess its overall performance. 

» 

Definition of the problem"^and some notation 

f * * * 

^ Imagine we have a quantitative criterion; X, which we want, to f , 
f * predict by a multiple-choice test, Y, containing k items (Yl, Y2... Yk)\ 

f 

Without any loss of generality, we can assume^Jfat the scores of X are 

scaled, or grouped, in a finite number of values (C).* Therefore any 

person taking X has a score such that: 

0 < X < c * . (1) 

A typical i/fom, Yg, has a options: one correct and (a-1) incorrect. 
•Since not every examinee attempts to answer all items we must define an 
additional category for omissions. We consider this category to be as 
important , •meaningf^ and informative as the other a options. If wd let 
^=(3+1^, we cap represent- the responses of all the examinees to a given 
item in *an rxC contingency table. Each row represents one' option of the 



• J 



item, each column represents one .level of score on the criterion X, and ' 

J . J- 

thf typical entry* in the table, nij,- is the number of people with score 
Xj who selected option i. Following the regular statistical notation we 
let n.i. ni. and n denote the mar^nal column, row and total frequencies^ 
respectively. At this stage we nefcd to select an iade^c of association 



to describe the relation between X and Yg as reflected by the contingency 
table. By direct analogy to the dichotomous scoring rule the multinomial \ 
generalisations of .the biserial |nd point biserial correlations suggest 
'themselves as possible candidates/ Indeed -Donlon & Fitzpatrick (1978) • 
have already proposed using the multiserial correlation (Jaspen 19^6) as 
a generalized discrimination index. For our purposes we prefer the 

point multiserial coefficient (Das Gupta 1960J?Hamdan & Schulman 1975) 

'j * » # ■ 

for several reasons: 

(i) Unlike the multiserial , 'it >is a PRE ^measure (Costner 1965), i.e, »' 

2 ^ w - 

Rp mg can ^be interpreted as the percentage of variance of X accounted 

for by Yg. 

* 

(ii) Unlike the multiserial, its values are bounded, i.e. -1 < R < 1. 

1 - pms - 

(iii) Unlike the multiserial, the weights assigned tit the different cate- 
gories of Yg are' not determined by ^ny distributional* assumption. 

(iv) * These* weights can be selected in a . way that maximizes the linear 

relationship between X and Y . These weights (Y .) are a linear 

function of the mean criterion score of the examinees selecting 

option i. In particular, if we let X. be the mean score of , the . ' 

.th 



people who selected the i option (i = 1.. . ,'r): 
X. = (I nij X j)/ni. , 



(2) 
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then the optimal weight^are given by (Das Gupta 1960): 

> - 
' Y ' = A X. + B , . " * (3) 

If we select A = 1 and B = 0 we can -express the point multiserial fndex 

in * very convenient form. (Hamdan &vSchulmaa 1975) . • * 
* \ ?* 

[- I ni. X. 2 - (i Z ni. X:) 2 ] 
l n . 1 v n . 1/ J , 

2 l . i 

Kn.* = ' 1 s— ' (*) 

\ * • : 

, [Jln.j X j 2 - (Jln.j X j) 2 ] 
J j 



pms 



, This .particular' weighting' nas two attractive properties: 



(a) As Das Gupta (1960) points out, the squared optimal 'point multi- 

* •'-».. . . • i 

serial is equal to the" square multiserial eta (Wherry & Tdylor 

■» • 

1946) , s - a 

' ' ' 

(b) R can be expressed as a ratio of two standard deviations (Hamdan * 

piuS * 

& Schuliaan 1975) : £ 

S(Xi) ' • 

, R pms * S(X) / ^ 



A model for evalu ating tbe effects of DOW 

: ; — ■ * 

\ 7 * « * 

It &eem& only natural to compare the optmaT scoring rule t§ tbf 

regular dichotomous alternative-^ 1 = r^ght, 0,= wrong). This indeed. is 

eaSily done within the framework af this model. Note that if the rlumber 

of categories, r, is reduced to 2, then the point multiserial is just 

the regular point biserial. Furthermore it ;s well known~-£Das Gupta 

1960) that if "r=2, R ' is invariant to linear transformations of Y 

and .V^' other words the correlation will not be changed if we 




replace the 0-1 weights by' the optimal* weight? . Thg implication is 

obvious — oae can compare the effectiveness of the two- scoring rule by 

* " „ • * 

the percentage of variance accounted for, vhen 2 or r categories are 
used and,, if some distributional assumptions are made, test whether the 
difference is significant. But note that /scoring by 2 or r categories 
are only epd points on a continuum of different optimal scaring rules. 
We (faS^efine a hierarchy of models (all of them optimal) which vary in 

1 

'terms of their complexity apd of the number of categories used by the 
scoring -procedure . Consider the following model* : 

(i) . r categories - all r 9ptions 

(ii) (p+1) = (r-qtl) options - q catego'ries are combined int6 one ' 
while p are left unchanged. 

(iii) 3 options - right, wrong, omit*, 
.(iv) 2 option? - right, "v^ong + omit. 

Models (i), (iii), and (iv) are natural and well known. We need to say 

» 

a word about (ii). It .defines a class of models in which two or more 

options are combined on the basis of empirical or theoretical justifi- 

cations/ If one option is selected with very low probability it may be, 

reasonable to ^combine it with the "omit"* .option. If there is some 
t * 

natural relation between some of the distracters it may seem natural to 
combine them according to this characteristic (see Echternacht 1976 for 
such items), etc. The most important point is that the responses can be 
scored in a variety of ways, using different number of categories, and 



for e'ach. model optimal weights can be easily deriv^Mjy the same rule 
(3).* One could compare all the§e models and select tlje bes^t otie - i.e. 
the one which predicts the highest proportion of variance in X relative 
to the number of parameters fitted (the rtilmber of categories). 



We now examine the effect of combining. q categories into* one, while 

keeping the first p unchanged, • on the correlation*! Define the new cate- 

» ? 

gory Yc, and also define: 



r 

nc. = I Hi. - * . (6) 



i=p+l 



Xc. = (Zni. Xi)/nc. \ " OT 

i=p+l " • 

These manipulations do no^ affect the denominator and the second term ia 
the numerator of (4). The first term in the* numerator^an be rewritten 



as: 



) 



i {Z ni. Xi 2 + nc. Xc 2 ] * ' . (8) 

n i=l ~ 



and if we let r' , , be the new point multiserial correlation^it can be 
pms(q) t 



easily showji that': 



. I I ni. nj. (Xi - Xj) 
i/j=p*l . 

k >s -"pms(q) - nnc „ S 2 (X), 



R 2 - R 2 , = * = : - . (9) 



If we onJLy combine two categories (say k 'and, 1)., this is reduced to: 

' 4 ' " 2 .. ' 

n k. n i. ( *i " V' 

R 2 - R 2 . , M = — : ^ • * ' (10) 

pms pins (2) n (n ^ f . n ^ } s 2 (x) 

EqV. <9) is always positive, which implies that if one. reduces the number 

of categories the correlation yith" the criterion muft always decrease. 

\ f 

The reduction" in' percentage of variance accourited for is a monotonically 



V 



decreasing function of the sample size, the 'variance Of the criterion 
and the size of the new category; it is a monotonically increasing, 

function of tti& weighted sum of squared pairvise differences between the 

/ ' 

means of the q categories combined. These relations suggest that>using 
simpler scoring rules (combining categories) may have only a negligible 
^ effect on the item validity whe>n the means of the combined grou]^are 

relapvely homogeneous and the sample size and criterion variance are 
. large. On the other hand, if the sample size and variance of X are 
small ajid if the means are relatively heterogeneous, the more complex 
rule can significantly increase the correlation. Finally, for a given 
criterion (with a fixed variance) administered to a fixed sample (fixed. 1 
n), the best way to simplify the scoring rule is to combine the categories, 

with the most similar means. 

* 2 
If we are interested in testing hypotheses about R • we must 

* • t" h 

assume that the .criterion conditional distribution at. the i level of j 



YCi = l...r) is or ) (gamdan & Schulman 1975). In this model^we 

can test independ 
(s « 2. r), by: 



can test independence if* ms = 0) for any scoring* rule with s categories 



. F- U-.) ^/(.-Dd-R^,) •. . " (11) 

This statistic has an F distribution with (s-1) and (n-s)'d.f. Under the 
null hypothesis . To^ test equality of two models with si and s2 categories 

I(H £ P p»s(sl) = P pms(s2)I We can u8e th f st * tistic: " 

[R pms(sl) " R prps(s2) ](n " sl) ' 

F=— " 5 p (12) 

. ' , • f 1 " R p n s( sl )Hsl-s2) _ 
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which is'.disfributed as- an,F with *(sl-s2) and (n-sl>^d-rf. $$der Ho. 
Generally, for each item, Yg, a series of tests similar to tnos^ -per- 
formed in a standard regression analysis dan, be us.ed in order to assess 
^he best scoring rule and its effectiveness (Cramer lr972). 



w 



The effect of different scoring rules on other i%^rTD^ test characteristics 

(A) Item variance , / ? ' ' 

* * • • 

; Once the weights to be attached to % the r options are determined We ^ 



can calculate the rtem's variance by the regular formula: . 

s\ = [\ i n>. y*/- U I ni. Y ) 2 ] , '• '.' (13) 

• , 8 n i=l *Y n 1=1 81 . 

which is just a^ re -express ion of the numerator of (4). - Therefore, ^f^r^ ^ 
combining q categories into ohe, the reduction in 'the item's variance is: 



. " I I ni. oj. (Y . - Y .) 

2 /i^P + l t ^ " 

S - Sj/ v = • . C (14) 

* g g(q) , . n nc4f v 



v * The sum of squared pairwise differences in (14) is r just a^inefcr . 4 
functio'n of the variance of the means in the combined categories around 
Xc. Therefore, (14) indicates that when' a simpler scoring rule is, ^ 
employed the 'variance of the responses in each item is invariably 1 reduced, 
and this reduction is proportional to the variance of the optimal weights 
and inversely related to sample size. Minimal reduction in variance for 
any given item wilT be obtained when we combine, categories with homo- 
geneous means.' . * . 



erJc 



(b) Interitem correlation A <w ^ 0 . % 

Consider two arbitrary items in the test, .Yj£ and Yh, scored on" all 
categories. Given their correlations with the criterion, Rxg *nd Rxh 

v * 

(we drop the pms notation *f or simplicity)/ their inte'rcorrelation is 



restricted, by (Glass & Collins 1970): 

^ Rgh = {Rxg Rxh ±^(HR 2 xh)(l-R 2 xg)}' (15) 



We consider the effect of combining q Categories on this interval. Let 
11 and 11 (q) derfote the lower limits of the interval when the items are 
scored witk r and (p+1) categories, respectively. The difference between 
these lower bounds is: - „ 

t 11 - lf(q) = [Rxg Rxh - Rxg(q) Rxh(q)] + 

[V(l-R xg(q))(l-R xtt(q)) - VO"R xh)(l-R xg)] , 

* 2 s 2 

Since it was shown that R \> R (q) (s = 1... r) , and consequently that 

xs , xs 

(1-R )<(R^*(q)), it follows that eq* (16) is always positive--combining 

XS X9 

categories reduces the lower bound for inter-item correlation. 

Let lc* an^ lc(q) represent the length of the interval, or in other 
words the range of values that .Rgh can take, when r or (p+1) categories 
^are used. It can be mt Ifchown : :that : 

lc(q) -flc = 2^V(l-R 2 ig(q))(l-R 2 xh(q)) - V(l-R 2 xh)(l-R 2 xg)] . (17) 

The ranfce of possibly values of Rgh is increased or, in other words, the 

/ ^ • ' ' ' \ > 

restrictions imposed on the internal relations between items through 



their correlations with the^ external criterion X a^e relaxed. The 1 
bound yid the length of the interval determine its upper limit (ul) 



ower 



-9- 

14 



.Combining the information from (1^) and (17) it follows that: 

Ul(q) > Ul if [lc(a) - lc] > [11 - ll(q)'] . * (.isf 

It becomes clear that the upper limit of the interval can increase, 

decrease or remain .unchanged depending on the nature and Magnitude of 

y ■ r# 

the changes in the item-criterion correlations. This is a particularly 

interesting^ result because it demonstrates one possible explanation for 

the lack of improvement' in validity of a test. Although DOW improves 

the individual items validities, it can also simultaneously increase the 

* "^^interitem correFarf&ns and the overal system validity can remain practi- 

cally unchanged-. - v 

If we assume that the values of Rgh are symmetrically distributed 

within the interval, its expected value is at the^ central point (see 

♦Mulaik 1976 for an elaborate proof for the special case Rxh = Rxg = R) . 

* 

If we use an r categories scoring system: 
. E fRghlltab; Rxg] = Rxh Rxg , % < ^19) 

c 

and for (r-q+1) categories r , ^^T"^ 

E [Rgh(q)|Rxh(q); Rxg(q)] = Rxh(q) Rxg (ft . * (20) 
Therefore we can write •£ 



E [RghjRxh; Rxg - Rgh(q) |Rxh-(q) ; ^xg(q)] = 
' [Rxh tfg - Rxh(q) Rxg(q)] 



(21) 



The expected value of the correlation between a pair of items decreases 
after combining q categories. Consider *the explanation for the lack of 
improvement in validity offered in the' previous paragraph. The last 

" 15 

. ■ -10- *° * 
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result demonstrates^^at^uch a situation is net only possible but also 
very likely to occur- J ^ 

(c) Variance of scores / 

The score of each individual on scale Y- is defined as the sum of 
the scores on the k itens coapos ing" the scale. Therefore the variance, 
S^(Y), is given by: ' 



S 2 (Y) = I Si&t II Sij (22) 
i=l i/j-1 

where. Sij is the covaxiancfe of items i and j. Ve hive already shown, 
that simplified scoring rules have the effect of invariably reducing the 



/ item variances, standard deviations and the lower bounds of their itker- 
correlation and, conditional upon the symmetric distribution, of Rij 
given Rix and Rjx, the expected interitem correlations. These facts, 
combined together, Medicate that the variance of the Y scores is very 
likely to* decrease when categories are combined. In fact, a sufficient 
condition fOr'this to happen is that Rij(q) < Rij (i,j = k) J . 

A special case, which will be discussed later, is one where the 
"^ee^ibination of the categories has an unif^pn effect on all the items, 
i.e. 'each item varignct is reduced by the same proportion. Note that 
this does not imply tlyt the item variances are equal when r or p+1 

* categories are used, but rather indicates the fact that there is a 

V } 

functional relation^ between the number of categorj.es and the item vari- 
ances and that cppbining q categori^\has a relatively homogeneous effect 
bn a}l q variances. Formally, let S?(q) = d 2 (i ^ 1. . .k, o '< d < 1) , 

and in this case: 

* * . « 

• v . 

<r rl 



er|c V- " - ' ^ 



k % k '*> 

S 2 (Y) - S 2 (Y)(q) = (i-d) JZ Si 2 + I Z Si Sj (Rij - Rij(q))] . (23) 

i=l i*j=l 



(d) . Reliability k * ' 

A popular net hod of calculating reliability is to obtain the ratio 

4 

,of the mean interitem covariance and the nean iten variances, Ryy, and 

* / I • 

to use it as an estimator of tb^jreliability of a single item in the 
Spearman-Brown prophecy formula (Stanley 1971). If the score is based 
on r categories: \ 



k 1 • • 

H 2ISi Sj Rij 
. i*j=l 

Ryy : , „ ' , (24) 

(k-ir z s -i. * 
1=1 



and if only (p+1) are used: 

* 

'k 

k Z Z.Sdta) Sj(q) Rij(q) 
Ryy(q) = ' . "\ (25). 



• (k-1) 2 Z S 2 l(q)' 
1=L 



The difference between the two estimates can be* written as: * ' . 

- * : * 

k k I 

k [Z II S 2 l(q)SiSjRij-S 2 lSi(q/sj(q)Rij($)] . 

Ryy - Ryy(q) = H ■ 1 — 1 j . (26) 

•k k s • 

(k-l> 2 [Z S 2 1 Z S 2 l(q)] . 
1=1 1=1 



It appears that the effect of the scoring rule on the reliabilities 
dependf on the pattern of variances, covariances and their respective 



reductions. To simplify formula (25) we assume that when categories are 
combined the variance of each item is reduced -by an amount proportional 
to its initial magnitude, i.e. Si(q) = d Si (i = 1 . . .k, o < d < 1) . In 
this case: , 

*w ' : ' C 

k'[lll Sn SvSj (Rij - ftij(q))] 

Ryy - Ryy(q) ± : • (27). 

* " * k ^ " 

'* (k-1} 2 [Z S 2 ij 2 % ' 
1=1 



The amoupt of "reduction in the test reliability is independent of the 
Constant d^ and it is proportional to the weighted sum of reductions in 
, item intercorrelations which were discussed in a previous section* The 
direct relation between Jthe reliability of a test aod the mean item 
intercorr61ation was demonstrated empirically in a recent paper by Bejar 
and Weiss (1977) . . 
(e) Test va lidity 

We now combine some of the results from the previous sections in 

r . - 

order to examine ,tjie behavior of 'the- validity of Y (ibcy.). Gulliksen 

/* 

(1950 p. 382) gives, the formula for the total test validity as a func- . 

J \ 

tipn of xhe item ^alidititfs aifd the test variance: * 

« * 

k v v 

< Rxy = [2 Rxg 3g]/S(Y) . s . (28.) 

After combining q categories the validity becomes: 



k V 

«xy(q) = [Z RxgCq) Sg(q) ]/S(YHq) . (29) 



1*1 




and the reduction in the percentage of variance of the criterion explained* 
'.by the predictor isi t 

* ■ • i > 

k k 
[S 2 (J3f)(q) Z S 2 g R 2 xg - S 2 (Y) Z S 2 g(q) R 2 xg(q)] / 

2 2 8 « .' 
Rxy - Rxy(q) = , S- . (30) 

^ . [S Z (Y) S Z (Y)(q)] .♦• 

- * < 

\ ;. ... 

Using again the assumption of uniform reduction is variance across 

items (?i(q) = dS.) we can rewrite the last equation as a function of 

1 « , * 

x variances and correlations;. «• ?.» ^ 

\ ' • ' - . ' 

* k k . v k k . e . 

. [Z Z S 2 1 S 2 g(6 2 xg-R 2 xg(q))"+ Z Z Z S 2 g SiSj(R 2 xg Rij (q)-R 2 xg(q) RijW 

R«y(q) 4 ■ : ^ Ml — 

' *\ k k k k 

. A [Z S 2 1 + 11 SiSj Rij][Z S 2 1 + Z Z SiSj' Rij(q)] ' (31) ' 

vi . iir ^ ' 

< \ - 4 - . 

Note that . the second term in the numerator involves the item-test as 



Evaluate the impact of the new scoring rul« on tjie validity. While the 
first term in Vhe numerator is always positives/the second can also^ 
assume negative values. * In fact, if we assume that all correlations 

<f ' * ' • 

with the criterion are reducec by kn amount proportional to their initial 
value (Rij(q) = d Rij, i£j = l...k, J) < 4 < 1), the second tprm vanishes. 
Equation (31) provides further support to the # explanation offered in the 
previous ^section^ to the lack of improvement invalidity. It is clear 
that the overall improvement in validity depends on the effect of the 
scoring procedure on both the item correlations and interit&m correlations. 
We % can, expect a significant gain in the 'percentage of variance predicted 
in tests ir\ which we can significantly improve the item validities and 

• • • < , - , 

ft fl4- 
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reduce the t inter Item correlations * (or at least not increasje them) • 
This is more likely to happen if the. initial item validities are low. 
Final Rem arks . 

- — ■ - f 4 

*' -~ * 

In the* introduction we hi$e ea^hasized that the function being 

optimized is the iftem-crite£ion correlation, and that an external and 

independently measured criterion is necessary. *We are not award of any 

empirical or theoretical study in w^ich the procedure examined herdf w$s 

used, although French (1952X has .pointed out some of its desirable 

properties. However, several studies (e.g. Hendrickson 1971, Echternacht 

1976) jhave used a similar technique^/. The main difference between their 

— r • - ^y 

approach aild the present one is that,, instfead of *n. external criterion, 



they use £he score on the remaining (k-1) Itenrf* of tbfc test and therefore, 
instead of optimizing external validity, they optimize internal consis- 
tency. A problem in this approach is that the two\ya riabj.es beting 
correlated are not experimentally independent—the weights for item Yg 



depend on the scores'on thfi other (k-1) ^.tems, and these sccTres depend on 
.theioptimal weights. One solution to tttis problem is to use an iterative 
procedure in which the weights and the criterion- are recalculated until 
the, increase rn reliability does not exceed a f ixed^respecif ied value. 
. Typically the convergence was fp.un^ to b£ very quick and' the improvement 
in reliability only marginal. What are the implications of these findings^ 
to the procedure outlined here? It is hard to judge but there are^good 

reasons tp believe that using an external criterion £o determine the 

* * * * • • 

weights should yield better results. In the ^iterative procedure the 



mi 



tial weights Jke either (0,1) or (- (~JJ> Note 'that these are* 

/ ' I r 1 
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the most non-opt{mal weights, since it was pointed out that the improve- 
nent in item validity is proportional to squared differences between the 
means. The internal consistency procedure is likely to improve its 
performance if different .starting values are used. Possible candidates 
fof this role seem to be (a) optioft-test point biserial or bl serial* 
correlations, (b) theoretically determined a priori weights, or (c) 
weights proportional to the means .calculated from a second independent 

sample. Empirical work comparing these .different starting peipts for 

• \ 

tfre iterative, algorithm and the proce'diAre outlined above is needed. 

» — ' 

We have outlined .a technique for differentially weighting options 
of a multiple choice ^pst in a fashion that maximizes the item predic- 
tive validity. The rule can be applied with different number o.f cate- 

gories and the "optimal" number of categories can be determined by > 

- 2 
significance tests and/or through the R criterion. Our theoretical 

analysis indicates that more complex* scoring rules have: higher item 

7 ( 

^Validities, higher item variances, higher score variances , and are also 
likely to increase, the inter-item correlations and the test reliability. 

k s * ■ 

s I * > ■ • 

A plausible explanation for the apparent paradox *of lack of improvement 

in the test validity, based on. the relation between interitem correlations 
and item validities x was offered. v* 
' The mechanism suggested as the cause of this phenomenon was deve- 
loped withip the framework of the particular optimization procedure 
examined in this study. Yet, simiTSf explanations could be Offered, foj: 
other DOW procedures since all o& ,them are developed at the item level 

4 ' 

and do not account for the interitem relations. 
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[ Overall, it appears that the key to the success of any DOV procedure 
is in "the nature of the, test's items. A* scoring rule is likely to be t 
successful *if a t^fest contain items with distracterj. which can differ- 
entiae between various levels of partial information , l.e . distracters 
that have^dif ferential appeal far differeirt ability levels, j 11 the 
distracters are* *elativ^|y hdmogened^^thfcs procedure far any other DOW 
technique) is not likely to be successful. Therefore we speculate that 
DOW have^ higher probability of success in achievement and criterion 
referenced tests, and in tests in which the distracters are systemati- 
cally desired to reflect Afferent levels of partial' information, (e.g. 
Echternacht 1976). More theoretical and empirical work on this question 
is pece^ary. ' ^ 
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